Search CORE

34 research outputs found

A Flexible Graph-Based Data Model Supporting Incremental Schema Design and Evolution

Author: Braunschweig Katrin
Lehner Wolfgang
Thiele Maik
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 26/01/2023
Field of study

Web data is characterized by a great structural diversity as well as frequent changes, which poses a great challenge for web applications based on that data. We want to address this problem by developing a schema-optional and flexible data model that supports the integration of heterogenous and volatile web data. Therefore, we want to rely on graph-based models that allow to incrementally extend the schema by various information and constraints. Inspired by the on-going web 2.0 trend, we want users to participate in the design and management of the schema. By incrementally adding structural information, users can enhance the schema to meet their very specific requirements

Qucosa

HSSS - Hochschulschriftenserver der SLUB

Technische Universität Dresden: Qucosa

OPEN—Enabling Non-expert Users to Extract, Integrate, and Analyze Open Data

Author: Braunschweig Katrin
Eberius Julian
Lehner Wolfgang
Thiele Maik
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 27/01/2023
Field of study

Government initiatives for more transparency and participation have lead to an increasing amount of structured data on the web in recent years. Many of these datasets have great potential. For example, a situational analysis and meaningful visualization of the data can assist in pointing out social or economic issues and raising people’s awareness. Unfortunately, the ad-hoc analysis of this so-called Open Data can prove very complex and time-consuming, partly due to a lack of efficient system support.On the one hand, search functionality is required to identify relevant datasets. Common document retrieval techniques used in web search, however, are not optimized for Open Data and do not address the semantic ambiguity inherent in it. On the other hand, semantic integration is necessary to perform analysis tasks across multiple datasets. To do so in an ad-hoc fashion, however, requires more flexibility and easier integration than most data integration systems provide. It is apparent that an optimal management system for Open Data must combine aspects from both classic approaches. In this article, we propose OPEN, a novel concept for the management and situational analysis of Open Data within a single system. In our approach, we extend a classic database management system, adding support for the identification and dynamic integration of public datasets. As most web users lack the experience and training required to formulate structured queries in a DBMS, we add support for non-expert users to our system, for example though keyword queries. Furthermore, we address the challenge of indexing Open Data

Qucosa

HSSS - Hochschulschriftenserver der SLUB

Technische Universität Dresden: Qucosa

Column-specific Context Extraction for Web Tables

Author: Braunschweig Katrin
Eberius Julian
Lehner Wolfgang
Thiele Maik
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 14/06/2022
Field of study

Relational Web tables have become an important resource for applications such as factual search and entity augmentation. A major challenge for an automatic identification of relevant tables on the Web is the fact that many of these tables have missing or non-informative column labels. Research has focused largely on recovering the meaning of columns by inferring class labels from the instances using external knowledge bases. The table context, which often contains additional information on the table's content, is frequently considered as an indicator for the general content of a table, but not as a source for column-specific details. In this paper, we propose a novel approach to identify and extract column-specific information from the context of Web tables. In our extraction framework, we consider different techniques to extract directly as well as indirectly related phrases. We perform a number of experiments on Web tables extracted from Wikipedia. The results show that column-specific information extracted using our simple heuristic significantly boost precision and recall for table and column search

Qucosa

HSSS - Hochschulschriftenserver der SLUB

Technische Universität Dresden: Qucosa

Frontiers in Crowdsourced Data Integration

Author: Braunschweig Katrin
Eberius Julian
Lehner Wolfgang
Thiele Maik
Publication venue: 'Walter de Gruyter GmbH'
Publication date: 26/11/2020
Field of study

There is an ever-increasing amount and variety of open web data available that is insufficiently examined or not considered at all in decision making processes. This is because of the lack of end-user friendly tools that help to reuse this public data and to create knowledge out of it. Therefore, we propose a schema-optional data repository that provides the flexibility necessary to store and gradually integrate heterogeneous web data. Based on this repository, we propose a semi-automatic schema enrichment approach that efficiently augments the data in a “pay-as-you-go” fashion. Due to the inherently appearing ambiguities we further propose a crowd-based verification component that is able to resolve such conflicts in a scalable manner.Die stetig wachsende Zahl offen verfügbarer Webdaten findet momentan viel zu wenig oder gar keine Berücksichtigung in Entscheidungsprozessen. Der Grund hierfür ist insbesondere in der mangelnden Unterstützung durch anwenderfreundliche Werkzeuge zu finden, die diese Daten nutzbar machen und Wissen daraus genieren können. Zu diesem Zweck schlagen wir ein schemaoptionales Datenrepositorium vor, welches ermöglicht, heterogene Webdaten zu speichern sowie kontinuierlich zu integrieren und mit Schemainformation anzureichern. Auf Grund der dabei inhärent auftretenden Mehrdeutigkeiten, soll dieser Prozess zusätzlich um eine Crowd-basierende Verifikationskomponente unterstützt werden

Qucosa

HSSS - Hochschulschriftenserver der SLUB

Technische Universität Dresden: Qucosa

Top-k Entity Augmentation using Consistent Set Covering

Author: Braunschweig Katrin
Eberius Julian
Lehner Wolfgang
Thiele Maik
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 19/09/2022
Field of study

Entity augmentation is a query type in which, given a set of entities and a large corpus of possible data sources, the values of a missing attribute are to be retrieved. State of the art methods return a single result that, to cover all queried entities, is fused from a potentially large set of data sources. We argue that queries on large corpora of heterogeneous sources using information retrieval and automatic schema matching methods can not easily return a single result that the user can trust, especially if the result is composed from a large number of sources that user has to verify manually. We therefore propose to process these queries in a Top-k fashion, in which the system produces multiple minimal consistent solutions from which the user can choose to resolve the uncertainty of the data sources and methods used. In this paper, we introduce and formalize the problem of consistent, multi-solution set covering, and present algorithms based on a greedy and a genetic optimization approach. We then apply these algorithms to Web table-based entity augmentation. The publication further includes a Web table corpus with 100M tables, and a Web table retrieval and matching system in which these algorithms are implemented. Our experiments show that the consistency and minimality of the augmentation results can be improved using our set covering approach, without loss of precision or coverage and while producing multiple alternative query results

Qucosa

HSSS - Hochschulschriftenserver der SLUB

Technische Universität Dresden: Qucosa

Publish-Time Data Integration for Open Data Platforms

Author: Braunschweig Katrin
Damme Patrick
Eberius Julian
Lehner Wolfgang
Thiele Maik
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 16/09/2022
Field of study

Platforms for publication and collaborative management of data, such as Data.gov or Google Fusion Tables, are a new trend on the web. They manage very large corpora of datasets, but often lack an integrated schema, ontology, or even just common publication standards. This results in inconsistent names for attributes of the same meaning, which constrains the discovery of relationships between datasets as well as their reusability. Existing data integration techniques focus on reuse-time, i.e., they are applied when a user wants to combine a specific set of datasets or integrate them with an existing database. In contrast, this paper investigates a novel method of data integration at publish-time, where the publisher is provided with suggestions on how to integrate the new dataset with the corpus as a whole, without resorting to a manually created mediated schema or ontology for the platform. We propose data-driven algorithms that propose alternative attribute names for a newly published dataset based on attribute- and instance statistics maintained on the corpus. We evaluate the proposed algorithms using real-world corpora based on the Open Data Platform opendata.socrata.com and relational data extracted from Wikipedia. We report on the system's response time, and on the results of an extensive crowdsourcing-based evaluation of the quality of the generated attribute names alternatives

Qucosa

HSSS - Hochschulschriftenserver der SLUB

Technische Universität Dresden: Qucosa

Building the Dresden Web Table Corpus: A Classification Approach

Author: Ahmadov Ahmad
Braunschweig Katrin
Eberius Julian
Hentsch Markus
Lehner Wolfgang
Thiele Maik
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 12/01/2023
Field of study

In recent years, researchers have recognized relational tables on the Web as an important source of information. To assist this research we developed the Dresden Web Tables Corpus (DWTC), a collection of about 125 million data tables extracted from the Common Crawl (CC) which contains 3.6 billion web pages and is 266TB in size. As the vast majority of HTML tables are used for layout purposes and only a small share contains genuine tables with different surface forms, accurate table detection is essential for building a large-scale Web table corpus. Furthermore, correctly recognizing the table structure (e.g. horizontal listings, matrices) is important in order to understand the role of each table cell, distinguishing between label and data cells. In this paper, we present an extensive table layout classification that enables us to identify the main layout categories of Web tables with very high precision. We therefore identify and develop a plethora of table features, different feature selection techniques and several classification algorithms. We evaluate the effectiveness of the selected features and compare the performance of various state-of-the-art classification algorithms. Finally, the winning approach is employed to classify millions of tables resulting in the Dresden Web Table Corpus (DWTC)

Qucosa

HSSS - Hochschulschriftenserver der SLUB

Technische Universität Dresden: Qucosa

DeExcelerator: A Framework for Extracting Relational Data From Partially Structured Documents

Author: Braunschweig Katrin
Dannecker Lars
Eberius Julian
Lehner Wolfgang
Thiele Maik
Werner Christopher
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 09/06/2021
Field of study

Of the structured data published on the web, for instance as datasets on Open Data Platforms such as data.gov, but also in the form of HTML tables on the general web, only a small part is in a relational form. Instead the data is intermingled with formatting, layout and textual metadata, i.e., it is contained in partially structured documents. This makes transformation into a true relational form necessary, which is a precondition for most forms of data analysis and data integration. Studying data.gov as an example source for partially structured documents, we present a classification of typical normalization problems. We then present the DeExcelerator, which is a framework for extracting relations from partially structured documents such as spreadsheets and HTML tables

Qucosa

HSSS - Hochschulschriftenserver der SLUB

Technische Universität Dresden: Qucosa

cyclo-Tri-μ-oxido-tris{[(η5,η5)-1,2-bis(cyclopentadienyl)-1,1,2,2-tetramethyldisilane]zirconium(IV)}: a trimeric disila-bridged oxidozirconocene

Author: Atwood
Chiesi-Villa
Farrugia
Holger Braunschweig
Katrin Gruss
Mikhailova
Sheldrick
Thomas Arnold
Publication venue: International Union of Crystallography
Publication date: 01/01/2011
Field of study

The title compound, [Zr3(C14H20Si2)3O3], consists of three disila-bridged zirconocene units, which are connected via an oxide ligand, forming a nearly planar six-membered ring with a maximum displacement of 0.0191 (8) Å. The compound was isolated as a by-product from a mixture of [(C5H4SiMe2)2ZrCl2] and Li[AlH4] in Et2O

Crossref

Directory of Open Access Journals

PubMed Central

Online-Publikations-Server der Universität Würzburg